Bantry Submission - R notebook

This notebook is arranged in the order of the presentation slides and is intended to be read along side it.

It provides code used to generate findings and plots. It also goes further than what is presented, with additional plots and stats.

A couple of the plots in the presentation were generated using the data here but in Tableau. The datasets are provided here.

If you have any questions - feel free to reach out to Ronen (ronenbecker@gmail.com)

Load Relevant Libaries

In [43]:
library(ggplot2)
library(dplyr)
library(gridExtra)
library(lubridate)
library(RColorBrewer)
library(reshape2)
library(tidyr)
library(ggcorrplot)
library(repr)
library(treemap)

Slide 4:

Slide4: few details about our approach

In [44]:
Datathon <- read.csv("~/Google Drive/Data Science/Datathon for Social Good/datathon-social-good-master/Datathon data/Internal data track/Datathon.csv")
In [45]:
# How we analysed it slide
      # -----------------------------  #  Hub Perspective :
      ## Convert to date format
      Datathon$ActivityDate <- as.character(Datathon$ActivityDate)
      Datathon$ActivityDate<- dmy(Datathon$ActivityDate)
      
      #hub category aggregate Table
      hub_cat_agg <- Datathon %>% group_by(HubRandomID) %>%
        summarise(PA = sum(CategoryName == "Programmed Activities"),  
                  EAT = sum(CategoryName == "Education and Training"),
                  OOE = sum(CategoryName == "One Off events"),
                  SR =  sum(CategoryName == "Service Referrals"),
                  Min_DT = min(ActivityDate),
                  Max_DT = max(ActivityDate),
                  Duration = Max_DT - Min_DT,
                  duration_mnths = as.numeric((Duration/365))*12
        ) 
      
      #hub program (short name) aggregate Table
      
      hub_prog_agg <- Datathon %>% group_by(HubRandomID, ShortName ) %>% summarise(cnt = n())
      hub_prog_agg_corr <-  pivot_wider(hub_prog_agg ,names_from = ShortName, values_from = c(cnt), values_fill = list(cnt = 0))
      
      
      
      
      hub_cat_agg %>% ggplot(aes(x=duration_mnths)) + 
        xlab("Hub Running Duration in Months") +
        ylab("Count") +
        ggtitle("Hub Running Duration in Months", subtitle = "Distribution of Hub Durations for all Hubs") +
        geom_histogram(fill ="#377EB8", bins = 7, alpha = 0.7) +
        theme(
          plot.title = element_text(color="#4DAF4A", size=22, face="bold.italic"),
          axis.title.x = element_text(color="#E41A1C", size=14, face="bold"),
          axis.title.y = element_text(color="#E41A1C", size=14, face="bold"),
          axis.text=element_text(size=16)
        )

Slide 8

Slide8: Activities Across CHA

Data Prep:

  • Get the data to be based on a per month basis to make each hub comparible given their different running times.
  • Remove records that are not relevant to the analysis such as activities that ran with 0 attendees.
In [46]:
# Load the Data
Datathon <- read.csv("~/Google Drive/Data Science/Datathon for Social Good/datathon-social-good-master/Datathon data/Internal data track/Datathon.csv")
In [47]:
Datathon$ActivityDate <- as.character(Datathon$ActivityDate)
Datathon$ActivityDate<- dmy(Datathon$ActivityDate)

### Get a long table for all SRS for all Hubs
hub_pa_agg_mth <- Datathon %>% 
                    #### Remove all Activities with 0 attendees unless they are Service Referrals:
                    filter((CategoryName != "Service Referrals" & (ChildParticipants != 0 | AdultParticipants !=0 )) | CategoryName == "Service Referrals" ) %>%
                    #### Remove all SR with 'Number of Families Participating'
                    filter(ShortName != "Number of families participating in the Hub") %>% 
                    group_by(HubRandomID, CategoryName ) %>%
                    summarise(cnt = n())

# get min and max dates for each hub and join back to table
hub_dts <- Datathon %>% group_by(HubRandomID) %>% summarise(min_dt = min(ActivityDate),
                                                            max_dt = max(ActivityDate),
                                                            duration = max_dt - min_dt)

hub_pa_agg_mth <- hub_pa_agg_mth %>% inner_join(hub_dts, by= c("HubRandomID" = "HubRandomID"))
# Join maxmin dates back to table:
hub_pa_agg_mth <- hub_pa_agg_mth %>% mutate(duration_mnth = as.numeric((duration/365)*12), 
                        avg_cnt_mnth = cnt/ duration_mnth    # PAs per month                   
                       )

# Create a wide table useful for analysis 
hub_pa_agg_mth_wide <-  pivot_wider(hub_pa_agg_mth ,names_from = CategoryName, values_from = c(cnt, avg_cnt_mnth), values_fill = list(cnt = 0, avg_cnt_mnth =0)) 

### Rename columns:
hub_pa_agg_mth_wide <- hub_pa_agg_mth_wide %>% rename(
                                                      cnt_EAT = "cnt_Education and Training",
                                                      cnt_OOE = "cnt_One Off events",
                                                      cnt_PA = "cnt_Programmed Activities",
                                                      cnt_SR = "cnt_Service Referrals",
                                                      # avg_cnt_mth
                                                      avg_cnt_mth_EAT = "avg_cnt_mnth_Education and Training",
                                                      avg_cnt_mth_OOE = "avg_cnt_mnth_One Off events",
                                                      avg_cnt_mth_PA = "avg_cnt_mnth_Programmed Activities",
                                                      avg_cnt_mth_SR = "avg_cnt_mnth_Service Referrals"
                                                    ) #For renaming tibble column using dplyrpipe
                                           #operator

### Calculate create feature as proportion of month:
hub_pa_agg_mth_wide <- hub_pa_agg_mth_wide %>% mutate( total_avg_cnt_mth = avg_cnt_mth_EAT+avg_cnt_mth_OOE+avg_cnt_mth_PA+avg_cnt_mth_SR,
                                PA_avg_prop = avg_cnt_mth_PA / total_avg_cnt_mth,
                                EAT_avg_prop = avg_cnt_mth_EAT/ total_avg_cnt_mth,
                                OOE_avg_prop = avg_cnt_mth_OOE/ total_avg_cnt_mth,
                                SR_avg_prop = avg_cnt_mth_SR/ total_avg_cnt_mth,
                                SR_PA_ratio = avg_cnt_mth_SR/avg_cnt_mth_PA)

Slide8: What is the breakdown of each activity by category type:

  • This section compares hubs by examining
    • a. The average number of activities run each month by category as an average of all hubs.
    • b. The proportion of activities run each month broken down by category types as an average of all hubs.

Slide8: Average Activity Cat Count per Month (Left Bubble Chart):

In [48]:
hub_pa_agg_mth_wide %>% ungroup() %>% summarise(mean_OOE = mean(avg_cnt_mth_OOE),
                                                mean_PA = mean(avg_cnt_mth_PA),
                                                mean_SR = mean(avg_cnt_mth_SR),
                                                mean_EAT = mean(avg_cnt_mth_EAT)                                               
                           )
A tibble: 1 × 4
mean_OOEmean_PAmean_SRmean_EAT
<dbl><dbl><dbl><dbl>
0.00592813618.8050517.201913.660151

Approach:

  • We take the mean of the all the hubs average cnt per month of each activity category

Observations:

  • We can see that on monthly average hubs
    • run approx 19 PA
    • make approx 18 Service Referrals
    • Run 3 EAT activities

Slide 8: Proportion of Activities Per Month by Cat Proportions (Right Bubble Chart):

In [49]:
hub_pa_agg_mth_wide %>% ungroup() %>% summarise(mean_OOE = mean(OOE_avg_prop),
                                                mean_PA = mean(PA_avg_prop),
                                                mean_SR = mean(SR_avg_prop),
                                                mean_EAT = mean(EAT_avg_prop)                                               
                           )
A tibble: 1 × 4
mean_OOEmean_PAmean_SRmean_EAT
<dbl><dbl><dbl><dbl>
0.00019411340.48102340.42764030.09114217

Approach:

  • Initially we look at the activities based on their categories.
  • Taking the average number of each activity based on their category for each month that a hub was running
  • We then look at it as a proportion of all other category activities for that month (out of OOE, PA, SR, EAT)
  • We calculated these for each hub
  • We then took the mean ofthese proportions of all hubs to come up with the values above.
  • These tells the typical proportions of each activity based on category.

Observations:

  • They tells us that PAs and SR categories make up most of the activites each month.
  • Because these were averages other metrics they will not add up to 1.
  • The plot below depcits the individual breakdowns of these for each hub.

Slide 9

Slide 9: Visualising the proportions of each category activity for all hubs:

In [50]:
hub_cat_agg %>% select("HubRandomID", "Programmed Activities" = "PA","Education and Training" = "EAT","One Off events"= "OOE","Service Referrals" ="SR" ) %>%
          pivot_longer(cols = c(-HubRandomID), names_to = "Category")  %>%     # put all colums other than HubRandomID into Category
          ggplot(aes(x= factor(HubRandomID), y = value ,fill = factor(Category) )) +
          ggtitle("Hubs by Program Category Types", subtitle = "Category Proportion for Each Hub") +
          geom_bar(position = "fill", stat="identity" , alpha = 0.9 ) +
          theme(axis.text.x=element_blank(), 
                axis.ticks.x=element_blank(), 
                legend.title =element_blank(),
                plot.title = element_text(color="#D95F02", size=22, face="bold.italic"),
                axis.title.x = element_text(color="#D53E4F", size=14, face="bold"),
                axis.title.y = element_text(color="#D53E4F", size=14, face="bold"),
                axis.text=element_text(size=16)
                ) +
          scale_x_discrete(expand = c(0,0)) +
          scale_y_continuous(expand = c(0,0)) +
          ylab ("Proportion of Activities")  +
          xlab ("Hubs") +
          scale_fill_brewer(palette = "Set1", type = "div", direction = -1)
options(repr.plot.width=30, repr.plot.height=15)

Observations:

  • It is clear that some hubs are not making any service referrals
  • It is also clear that some hubs are making disproportionate service referrals
  • EAT and OOE make up a very small proportion of activities. Given this, our analysis will focus only on PA and SRs

Slide 10

Slide 10A: Distribution of average number of PA per hub per month (refers to first sentence):

In [52]:
hub_pa_agg_mth %>% filter(CategoryName == "Programmed Activities")  %>% 
ggplot(aes(x=avg_cnt_mnth)) +
xlab("Average PA per Hub per Month") +
ggtitle("Average PA Per Hub per Month", subtitle = "Hub Distribution")+
geom_dotplot(binpositions="all",stackgroups = TRUE,binwidth = 0.7, aes(fill = "light green") )+
theme(legend.position = "none",
     axis.text=element_text(size=16)
     ) 
options(repr.plot.width=30, repr.plot.height=6)
In [53]:
hub_pa_agg_mth %>% ungroup() %>% filter(CategoryName == "Programmed Activities")  %>% 
 summarise(min = min(avg_cnt_mnth),
                          q1 = quantile(avg_cnt_mnth,0.25),
                          mean = mean(avg_cnt_mnth),
                          median = median(avg_cnt_mnth),
                          q3 = quantile(avg_cnt_mnth,0.75),
                          max = max(avg_cnt_mnth),
                          sd = sd(avg_cnt_mnth),
                          IQR = IQR(avg_cnt_mnth),
                          LowOutlier= (q1 - 1.5*IQR),
                          HighOutliet = (q3 + 1.5*IQR),
                          latest_open_dt = max(min_dt),
                          latest_activity_dt = max(max_dt),
                          LowOutlier= (q1 - 1.5*IQR),
                          HighOutliet = (q3 + 1.5*IQR)
                           )
A tibble: 1 × 12
minq1meanmedianq3maxsdIQRLowOutlierHighOutlietlatest_open_dtlatest_activity_dt
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><date><date>
0.351637811.4765518.8050517.3431426.3049740.328819.5343814.82842-10.7660848.54762019-06-282019-12-20

Approach:

  • We take the average PA that each hub runs a month

Observations:

  • Typically, each hub runs about 19 Program Activities per month
  • There are a number of hubs who run less than 10. Although these are not outliers - one may want to investigate these, as these numbers are quite low.
In [14]:
hub_pa_agg_mth %>% ungroup() %>% filter(CategoryName == "Programmed Activities")  %>% filter (avg_cnt_mnth <= 10) %>% arrange(avg_cnt_mnth)
A tibble: 14 × 8
HubRandomIDCategoryNamecntmin_dtmax_dtdurationduration_mnthavg_cnt_mnth
<int><fct><int><date><date><drtn><dbl><dbl>
10932Programmed Activities 22019-06-282019-12-18 173 days 5.6876710.3516378
18200Programmed Activities1942017-01-232019-12-111052 days34.5863015.6091572
18812Programmed Activities2542016-04-252019-12-061320 days43.3972605.8529040
12756Programmed Activities2082017-01-232019-12-101051 days34.5534256.0196638
12736Programmed Activities1392018-01-292019-12-19 689 days22.6520556.1363087
12420Programmed Activities 722019-01-292019-12-19 324 days10.6520556.7592593
11662Programmed Activities2512017-01-232019-12-131054 days34.6520557.2434377
16724Programmed Activities2942016-02-012019-04-081162 days38.2027407.6957831
10011Programmed Activities2512017-04-172019-12-09 966 days31.7589047.9032954
17604Programmed Activities2912017-04-172019-12-05 962 days31.6273979.2008836
13876Programmed Activities3312017-01-232019-12-131054 days34.6520559.5521031
13958Programmed Activities3092017-04-172019-12-12 969 days31.8575349.6994324
11419Programmed Activities4582016-02-012019-12-121410 days46.3561649.8800236
15301Programmed Activities 732019-05-032019-12-12 223 days 7.3315079.9570254

Slide 10B: Distribution of average number of PA per hub per month for each PA type (Plot)

This sections is a more detailed look at the above, looking at the average per month based on PA types.

In [54]:
### Get a long table for all SRS for all Hubs

hub_pa_agg <- Datathon %>% filter(CategoryName == "Programmed Activities" & (ChildParticipants != 0 | AdultParticipants !=0 )) %>% group_by(HubRandomID, ShortName ) %>% summarise(
    cnt = n(),
    cnt_child = sum(ChildParticipants),
    cnt_adult = sum(AdultParticipants),
    avg_cnt_pa = ((cnt_child + cnt_adult) / cnt), # avg attendance per session
    avg_child_pa = cnt_child/cnt, # average child attendance per session
    avg_adult_pa = cnt_adult/cnt) # average adult attendance per session

# get min and max dates for each hub and join back to table
#hub_dts <- Datathon %>% group_by(HubRandomID) %>% summarise(min_dt = min(ActivityDate),
#                                                            max_dt = max(ActivityDate),
#                                                            duration = max_dt - min_dt
#                                                           ) 
# Join maxmin dates back to table:
hub_pa_agg <- hub_pa_agg %>% inner_join(hub_dts, by= c("HubRandomID" = "HubRandomID"))
# Develop new metrics - avg per month:
hub_pa_agg <- hub_pa_agg %>% mutate(duration_mnth = as.numeric((duration/365)*12), 
                        avg_cnt_mnth = cnt/ duration_mnth                      
                       )
# Develop wide table with column for each SR based on avg cnt per month
hub_pa_agg_wide <- hub_pa_agg %>% select(HubRandomID, ShortName, min_dt, max_dt, duration_mnth,avg_cnt_mnth ) %>% pivot_wider(names_from = ShortName, values_from = c(avg_cnt_mnth), values_fill = list(cnt = 0))
In [56]:
hub_pa_agg %>% ggplot(aes( x= factor(ShortName), y=avg_cnt_mnth, fill = factor(ShortName))) + 
      coord_flip()  +
      theme(legend.position = "none") +
      xlab("Program Activities") +
      ylab("Average Per Month") +
      ggtitle("Average Program Activities Per Month", subtitle = "Hub Distribution") +
      #theme(axis.title.x=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank()) +
      theme(
           # axis.title.x=element_blank(), 
            #axis.text.y=element_blank(), 
            #axis.ticks.y=element_blank(),
            legend.title =element_blank(),
            plot.title = element_text(color="#D95F02", size=22, face="bold.italic"),
            axis.title.x = element_text(color="#D53E4F", size=14, face="bold"),
            axis.title.y = element_text(color="#D53E4F", size=14, face="bold"),
            axis.text=element_text(size=16)
      ) +
      #scale_x_discrete(expand = c(0,0)) +
      #scale_y_discrete(expand = c(0,0)) +
      scale_fill_brewer(palette = "Set1", type = "div", direction = 1) +
      geom_boxplot(alpha =0.9)    
    options(repr.plot.width=30, repr.plot.height=10)

Slide 10: Alternate Plots:

In [18]:
hub_pa_agg  %>% 
ggplot(aes(x=avg_cnt_mnth)) +
xlab("Average Count Per Month") +
ggtitle("Average PA Per Month", subtitle = "Hub Distribution")+
geom_dotplot(binpositions="all",stackgroups = TRUE,binwidth = 0.1, aes(fill =factor(ShortName)) )+
facet_grid(rows = vars(ShortName)) +
theme(legend.position = "none",
     axis.text=element_text(size=16)
     ) 
options(repr.plot.width= 30, repr.plot.height=12)

Slide 10: Stats

In [19]:
hub_pa_stats<-  hub_pa_agg %>% group_by(ShortName) %>% summarise(min = min(avg_cnt_mnth),
                          q1 = quantile(avg_cnt_mnth,0.25),
                          mean = mean(avg_cnt_mnth),
                          median = median(avg_cnt_mnth),
                          q3 = quantile(avg_cnt_mnth,0.75),
                          max = max(avg_cnt_mnth),
                          sd = sd(avg_cnt_mnth),
                          IQR = IQR(avg_cnt_mnth),
                          latest_open_dt = max(min_dt),
                          latest_activity_dt = max(max_dt),
                          LowOutlier= (q1 - 1.5*IQR),
                          HighOutliet = (q3 + 1.5*IQR)
                           ) %>% arrange(desc(IQR))
hub_pa_stats
A tibble: 5 × 13
ShortNameminq1meanmedianq3maxsdIQRlatest_open_dtlatest_activity_dtLowOutlierHighOutliet
<fct><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><date><date><dbl><dbl>
Lifestyle 0.062974471.42996154.2241103.2520605.78093218.658153.8687964.3509702019-05-032019-12-20-5.09649412.307387
Playgroups 0.578813832.81476245.2504034.9937816.63479713.722872.8102073.8200352019-05-032019-12-20-2.91529012.364849
Parents 0.353168842.92630984.7791684.3927826.74400516.409562.7605173.8176952019-05-032019-12-20-2.80023312.470547
ChildLit 0.021556820.41598812.3437931.4987923.19625910.234452.5738602.7802712019-06-282019-12-20-3.754418 7.366665
EnglishClasses0.220410631.98493383.4953563.4623964.72188910.174742.0444932.7369552019-05-032019-12-20-2.120499 8.827321

Note: For these plots we excluded PA without attendees

  • On average there are 5 Playgroups and Parents program activities run a month
  • On average there are 4 lifestyle program activities run a month
  • On average there are 3.4 english classes program activities run a month
  • On average there are 2.3 Child Lit program activities run a month

There a some outlier hubs who run more than what is typical a month:

In [20]:
hub_pa_agg %>% filter(
                      (ShortName %in% c("Lifestyle","Playgroups", "Parents" ) & avg_cnt_mnth> 12 ) |
                      (ShortName == "ChildLit" & avg_cnt_mnth >7.3) |
                      (ShortName == "EnglishClasses" & avg_cnt_mnth > 8.8)
                     ) %>% arrange(ShortName)
A grouped_df: 11 × 13
HubRandomIDShortNamecntcnt_childcnt_adultavg_cnt_paavg_child_paavg_adult_pamin_dtmax_dtdurationduration_mnthavg_cnt_mnth
<int><fct><int><int><int><dbl><dbl><dbl><date><date><drtn><dbl><dbl>
11491ChildLit 451 6881 1507 18.59867 15.257206 3.3414632016-02-012019-12-201418 days46.61918 9.674130
14082ChildLit 342 2508 1357 11.30117 7.333333 3.9678362016-02-012019-12-201418 days46.61918 7.336037
14457ChildLit 463 6138 1082 15.59395 13.257019 2.3369332016-02-012019-12-201418 days46.61918 9.931535
18922ChildLit 357 9790 2055 33.17927 27.422969 5.7563032017-01-232019-12-201061 days34.8821910.234449
15994EnglishClasses473 2214 4214 13.58985 4.680761 8.9090912016-02-012019-12-161414 days46.4876710.174741
10833Lifestyle 58323776 6639 52.16981 40.78216111.3876502016-01-272019-12-191422 days46.7506812.470406
15669Lifestyle 192 5562 1753 38.09896 28.968750 9.1302082016-01-272016-12-05 313 days10.2904118.658147
18922Lifestyle 42767902 1378162.24824159.021077 3.2271662017-01-232019-12-201061 days34.8821912.241203
19483Lifestyle 78539575 5196 57.03312 50.414013 6.6191082016-01-272019-12-051408 days46.2904116.958156
13590Parents 7651201314932 35.22222 15.70326819.5189542016-02-012019-12-201418 days46.6191816.409556
11833Playgroups 194 2046 1160 16.52577 10.546392 5.9793812018-10-152019-12-19 430 days14.1369913.722868

Slide 11

Slide 11: Average attendance per PA for each hub

In [57]:
Datathon <- read.csv("~/Google Drive/Data Science/Datathon for Social Good/datathon-social-good-master/Datathon data/Internal data track/Datathon.csv")
In [58]:
### Get a long table for all PAs for all Hubs
hub_agg <- Datathon %>% filter(CategoryName == "Programmed Activities"  & (ChildParticipants != 0 | AdultParticipants !=0 ))%>% group_by(HubRandomID) %>% summarise(
    cnt = n(),
    cnt_child = sum(ChildParticipants),
    cnt_adult = sum(AdultParticipants),
    avg_cnt = ((cnt_child + cnt_adult) / cnt), # avg attendance per session
    avg_child = cnt_child/cnt, # average child attendance per session
    avg_adult = cnt_adult/cnt) # average adult attendance per session

# get min and max dates for each hub and join back to table
#hub_dts <- Datathon %>% group_by(HubRandomID) %>% summarise(min_dt = min(ActivityDate),
#                                                            max_dt = max(ActivityDate),
#                                                            duration = max_dt - min_dt
#                                                           ) 
# Join maxmin dates back to table:
hub_agg <- hub_agg %>% inner_join(hub_dts, by= c("HubRandomID" = "HubRandomID"))
# Develop new metrics - avg per month:
hub_agg <- hub_agg %>% mutate(duration_mnth = as.numeric((duration/365)*12), 
                        avg_cnt_mnth = cnt/ duration_mnth                      
                       )
# Develop wide table with column for each SR based on avg cnt per month
#hub_agg_wide <- hub_agg %>% select(HubRandomID, min_dt, max_dt, duration_mnth,avg_cnt_mnth ) %>% pivot_wider(names_from = ShortName, values_from = c(avg_cnt_mnth), values_fill = list(cnt = 0))
In [59]:
hub_agg  %>% ggplot(aes( x= 1, y=avg_cnt)) + 
  coord_flip()  +
  theme(legend.position = "none") +
  xlab("HubID") +
  ylab("Average Attendance Per Session") +
  ggtitle("Average Attendance Per Session", subtitle = "Hub Distribution") +
   ylim(0, 35) +
  #theme(axis.title.x=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank()) +
  geom_boxplot()

options(repr.plot.width=30, repr.plot.height=5)
Warning message:
“Removed 12 rows containing non-finite values (stat_boxplot).”
In [25]:
hub_agg  %>% 
ggplot(aes(x=avg_cnt)) +
xlab("Average Attendance per Hub") +
ggtitle("Average Attendance Per Hub", subtitle = "Hub Distribution")+
geom_dotplot(binpositions="all",stackgroups = TRUE,binwidth = 0.7, aes(fill = "light green") )+
theme(legend.position = "none",
     axis.text=element_text(size=16)
     ) 
options(repr.plot.width=30, repr.plot.height=6)
In [26]:
hub_agg_stats<-  hub_agg %>% summarise(min = min(avg_cnt),
                          q1 = quantile(avg_cnt,0.25),
                          mean = mean(avg_cnt),
                          median = median(avg_cnt),
                          q3 = quantile(avg_cnt,0.75),
                          max = max(avg_cnt),
                          sd = sd(avg_cnt),
                          IQR = IQR(avg_cnt),
                          LowOutlier= (q1 - 1.5*IQR),
                          HighOutliet = (q3 + 1.5*IQR),
                          latest_open_dt = max(min_dt),
                          latest_activity_dt = max(max_dt),
                          LowOutlier= (q1 - 1.5*IQR),
                          HighOutliet = (q3 + 1.5*IQR)
                           ) 
hub_agg_stats
A tibble: 1 × 12
minq1meanmedianq3maxsdIQRLowOutlierHighOutlietlatest_open_dtlatest_activity_dt
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><date><date>
6.53260914.5295523.1321719.6840728.470776.195612.6137213.94115-6.38217849.382432019-06-282019-12-20
  • When we exclude PA with 0 attendees, the mean attendance per PA is 23.
  • Maybe its worth examining hubs with sub 10 attendees on average per session
  • The upper outlier is approximately 50.
    • There are 3 hubs who average 50 or more attendees per session. Perhaps these hubs are overcrowded?
In [60]:
hub_agg %>% filter(avg_cnt >= 50)
A tibble: 3 × 12
HubRandomIDcntcnt_childcnt_adultavg_cntavg_childavg_adultmin_dtmax_dtdurationduration_mnthavg_cnt_mnth
<int><int><int><int><dbl><dbl><dbl><date><date><drtn><dbl><dbl>
123691324527321464250.8867139.8277911.0589122016-02-012019-12-201418 days46.6191828.400329
17604 291 7102 953557.1718224.4055032.7663232017-04-172019-12-05 962 days31.62740 9.200884
189221319891591134376.1956067.59591 8.5996972017-01-232019-12-201061 days34.8821937.812991
  • To see the list of hubs in order of attendance rates:

Slide 11: Average child attendance per PA for each hub:

In [323]:
hub_agg  %>% 
ggplot(aes(x=avg_child)) +
xlab("Average Attendance per Hub") +
ggtitle("Average Attendance Per Hub", subtitle = "Hub Distribution")+
geom_dotplot(binpositions="all",stackgroups = TRUE,binwidth = 0.7, aes(fill = "light green") )+
theme(legend.position = "none",
     axis.text=element_text(size=16)
     ) 
options(repr.plot.width=30, repr.plot.height=6)
In [324]:
hub_agg_stats<-  hub_agg %>% summarise(min = min(avg_child),
                          q1 = quantile(avg_child,0.25),
                          mean = mean(avg_child),
                          median = median(avg_child),
                          q3 = quantile(avg_child,0.75),
                          max = max(avg_child),
                          sd = sd(avg_child),
                          IQR = IQR(avg_child),
                          LowOutlier= (q1 - 1.5*IQR),
                          HighOutliet = (q3 + 1.5*IQR),
                          latest_open_dt = max(min_dt),
                          latest_activity_dt = max(max_dt),
                          LowOutlier= (q1 - 1.5*IQR),
                          HighOutliet = (q3 + 1.5*IQR)
                           ) 
hub_agg_stats
A tibble: 1 × 12
minq1meanmedianq3maxsdIQRLowOutlierHighOutlietlatest_open_dtlatest_activity_dt
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><date><date>
0.81521746.85067713.5942510.0557417.1591667.5959110.6877610.30848-8.61204432.621882019-06-282019-12-20
  • When we exclude PA with 0 attendees, the mean child attendance per PA is 13.
  • Maybe its worth examining hubs with sub 5 attendees on average per session
  • The upper outlier is approximately 32 children per session.
    • There are 4 hubs who average 32 or more attendees per session
In [61]:
hub_agg %>% filter(avg_child >= 32)
A tibble: 4 × 12
HubRandomIDcntcnt_childcnt_adultavg_cntavg_childavg_adultmin_dtmax_dtdurationduration_mnthavg_cnt_mnth
<int><int><int><int><dbl><dbl><dbl><date><date><drtn><dbl><dbl>
123691324527321464250.8867139.8277911.0589122016-02-012019-12-201418 days46.6191828.40033
13024 88734140 984249.5851238.4892911.0958292016-02-012019-12-091407 days46.2575319.17525
189221319891591134376.1956067.59591 8.5996972017-01-232019-12-201061 days34.8821937.81299
194831843621932326146.3667933.7455212.6212702016-01-272019-12-051408 days46.2904139.81386

Slide 11: Average adult attendance per PA for each hub:

In [62]:
hub_agg  %>% 
ggplot(aes(x=avg_adult)) +
xlab("Average Attendance per Hub") +
ggtitle("Average Attendance Per Hub", subtitle = "Hub Distribution")+
geom_dotplot(binpositions="all",stackgroups = TRUE,binwidth = 0.5, aes(fill = "light green") )+
theme(legend.position = "none",
     axis.text=element_text(size=16)
     ) 
options(repr.plot.width=30, repr.plot.height=6)
In [327]:
hub_agg_stats<-  hub_agg %>% summarise(min = min(avg_adult),
                          q1 = quantile(avg_adult,0.25),
                          mean = mean(avg_adult),
                          median = median(avg_adult),
                          q3 = quantile(avg_adult,0.75),
                          max = max(avg_adult),
                          sd = sd(avg_adult),
                          IQR = IQR(avg_adult),
                          LowOutlier= (q1 - 1.5*IQR),
                          HighOutliet = (q3 + 1.5*IQR),
                          latest_open_dt = max(min_dt),
                          latest_activity_dt = max(max_dt),
                          LowOutlier= (q1 - 1.5*IQR),
                          HighOutliet = (q3 + 1.5*IQR)
                           ) 
hub_agg_stats
A tibble: 1 × 12
minq1meanmedianq3maxsdIQRLowOutlierHighOutlietlatest_open_dtlatest_activity_dt
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><date><date>
2.6257.1996489.5379188.54607210.9934932.766324.8984323.7938371.50889316.684242019-06-282019-12-20
  • When we exclude PA with 0 attendees, the mean adult attendance per PA is 9 which lower than the child attendance of 13.
  • The upper outlier is approximately 16 adults per session.
    • There are 3 hubs who average 16 or more attendees per session.
    • Some of these are more than double and triple the mean. So worth investigating
In [63]:
hub_agg %>% filter(avg_adult >= 16)
A tibble: 3 × 12
HubRandomIDcntcnt_childcnt_adultavg_cntavg_childavg_adultmin_dtmax_dtdurationduration_mnthavg_cnt_mnth
<int><int><int><int><dbl><dbl><dbl><date><date><drtn><dbl><dbl>
11188762 48482422838.15748 6.36220531.795282016-02-012019-12-201418 days46.6191816.345205
16538996143681997534.4809214.42570320.055222016-02-012019-12-181416 days46.5534221.394774
17604291 7102 953557.1718224.40549832.766322017-04-172019-12-05 962 days31.62740 9.200884

Slide 11: Average attendance for each PA Types (Bottom Tree Map)

In [64]:
Datathon <- read.csv("~/Google Drive/Data Science/Datathon for Social Good/datathon-social-good-master/Datathon data/Internal data track/Datathon.csv")
In [10]:
### Get a long table for all SRS for all Hubs
hub_pa_agg <- Datathon %>% filter(CategoryName == "Programmed Activities"  & (ChildParticipants != 0 | AdultParticipants !=0 )) %>% group_by(HubRandomID, ShortName ) %>% summarise(
    cnt = n(),
    cnt_child = sum(ChildParticipants),
    cnt_adult = sum(AdultParticipants),
    avg_cnt_pa = ((cnt_child + cnt_adult) / cnt), # avg attendance per session
    avg_child_pa = cnt_child/cnt, # average child attendance per session
    avg_adult_pa = cnt_adult/cnt) # average adult attendance per session
    
# get min and max dates for each hub and join back to table
#hub_dts <- Datathon %>% group_by(HubRandomID) %>% summarise(min_dt = min(ActivityDate),
#                                                            max_dt = max(ActivityDate),
#                                                            duration = max_dt - min_dt
#                                                           ) 
# Join maxmin dates back to table:
hub_pa_agg <- hub_pa_agg %>% inner_join(hub_dts, by= c("HubRandomID" = "HubRandomID"))
# Develop new metrics - avg per month:
hub_pa_agg <- hub_pa_agg %>% mutate(duration_mnth = as.numeric((duration/365)*12), 
                        avg_cnt_mnth = cnt/ duration_mnth                      
                       )
# Develop wide table with column for each SR based on avg cnt per month
hub_pa_agg_wide <- hub_pa_agg %>% select(HubRandomID, ShortName, min_dt, max_dt, duration_mnth,avg_cnt_mnth ) %>% pivot_wider(names_from = ShortName, values_from = c(avg_cnt_mnth), values_fill = list(cnt = 0))
In [65]:
hub_pa_agg %>% ggplot(aes( x= factor(ShortName), y=avg_cnt_pa, fill = factor(ShortName))) + 
  coord_flip()  +
  theme(legend.position = "none") +
  xlab("Program Activities") +
  ylab("Average Attendance Per PA") +
  ggtitle("Average Attendance Per Program Activities", subtitle = "Hub Distribution") +
theme(axis.text=element_text(size=16)) +
  #theme(axis.title.x=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank()) +
  geom_boxplot() 
 options(repr.plot.width=30, repr.plot.height=15)
In [67]:
hub_pa_agg  %>% 
ggplot(aes(x=avg_cnt_pa)) +
xlab("Average Count Per Month") +
ggtitle("Average PA Per Month", subtitle = "Hub Distribution")+
geom_dotplot(binpositions="all",stackgroups = TRUE,binwidth = 1, aes(fill =factor(ShortName)) )+
facet_grid(rows = vars(ShortName)) +
theme(legend.position = "none",
     axis.text=element_text(size=16)
     ) 
options(repr.plot.width= 30, repr.plot.height=12)
In [68]:
hub_pa_stats<-  hub_pa_agg %>% group_by(ShortName) %>% summarise(min = min(avg_cnt_pa),
                          q1 = quantile(avg_cnt_pa,0.25),
                          mean = mean(avg_cnt_pa),
                          median = median(avg_cnt_pa),
                          q3 = quantile(avg_cnt_pa,0.75),
                          max = max(avg_cnt_pa),
                          sd = sd(avg_cnt_pa),
                          IQR = IQR(avg_cnt_pa),
                          latest_open_dt = max(min_dt),
                          latest_activity_dt = max(max_dt),
                          LowOutlier= (q1 - 1.5*IQR),
                          HighOutliet = (q3 + 1.5*IQR)
                           ) %>% arrange(desc(mean))
hub_pa_stats
A tibble: 5 × 13
ShortNameminq1meanmedianq3maxsdIQRlatest_open_dtlatest_activity_dtLowOutlierHighOutliet
<fct><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><date><date><dbl><dbl>
Lifestyle 2.83333312.78164634.0591821.6119844.71898162.2482432.96323631.9373322019-05-032019-12-20-35.12435392.62498
Parents 3.31578910.98433723.7403718.3769433.58360 74.3368416.95387422.5992662019-05-032019-12-20-22.91456267.48250
Playgroups 3.95454514.74074123.1166419.7791727.15000119.2424214.72325312.4092592019-05-032019-12-20 -3.87314845.76389
ChildLit 2.00000011.76923121.5269417.7279424.37083170.0000022.02919812.6016032019-06-282019-12-20 -7.13317343.27324
EnglishClasses1.888889 7.93149412.4897911.5134214.35926 66.75510 8.438513 6.4277652019-05-032019-12-20 -1.71015424.00091
  • There are 23 attendees on average per playgroup and parent engagement group.
  • There are 34 attendees on average per lifestyle group
  • There are 12 attendees on average per English class
  • There are 21 attendees on average per Child Lit class

Below are some outliers:

In [69]:
hub_pa_agg %>% filter(
                      (ShortName == "Playgroups" & avg_cnt_pa >45 |
                       ShortName == "Parents" & avg_cnt_pa >67 |
                       ShortName == "Lifestyle" & avg_cnt_pa >92 |
                       ShortName == "EnglishClasses" & avg_cnt_pa >24 |
                       ShortName == "ChildLit" & avg_cnt_pa >43
                     ))%>% arrange(ShortName)
A grouped_df: 15 × 13
HubRandomIDShortNamecntcnt_childcnt_adultavg_cnt_paavg_child_paavg_adult_pamin_dtmax_dtdurationduration_mnthavg_cnt_mnth
<int><fct><int><int><int><dbl><dbl><dbl><date><date><drtn><dbl><dbl>
15669ChildLit 70 3267 549 54.51429 46.67143 7.8428572016-01-272016-12-05 313 days10.29041 6.80244941
16999ChildLit 1 31 28 59.00000 31.00000 28.0000002016-02-012019-12-131411 days46.38904 0.02155682
17319ChildLit 1 66 104170.00000 66.00000104.0000002017-01-232019-12-181059 days34.81644 0.02872206
11491EnglishClasses 49 3219 52 66.75510 65.69388 1.0612242016-02-012019-12-201418 days46.61918 1.05106958
19483EnglishClasses209 28794319 34.44019 13.77512 20.6650722016-01-272019-12-051408 days46.29041 4.51497396
10884Lifestyle 91 73901689 99.76923 81.20879 18.5604402016-02-012019-12-041402 days46.09315 1.97426296
11491Lifestyle 195220333491130.89231112.98974 17.9025642016-02-012019-12-201418 days46.61918 4.18282793
13024Lifestyle 14921622 431148.00671145.11409 2.8926172016-02-012019-12-091407 days46.25753 3.22109690
18922Lifestyle 427679021378162.24824159.02108 3.2271662017-01-232019-12-201061 days34.8821912.24120327
19635Lifestyle 27931585 349114.45878113.20789 1.2508962016-02-012019-12-131411 days46.38904 6.01435152
10813Parents 95 36673395 74.33684 38.60000 35.7368422017-04-172019-12-17 974 days32.02192 2.96671800
11188Playgroups 128 32383214 50.40625 25.29688 25.1093752016-02-012019-12-201418 days46.61918 2.74565115
15830Playgroups 205 54344543 48.66829 26.50732 22.1609762017-07-102019-12-13 886 days29.12877 7.03771633
17604Playgroups 99 60145791119.24242 60.74747 58.4949492017-04-172019-12-05 962 days31.62740 3.13019751
19483Playgroups 268 75985202 47.76119 28.35075 19.4104482016-01-272019-12-051408 days46.29041 5.78953598

Slide 12

Slide 12: Average Adult Attendace per PA Type (Left Tree Map)

In [70]:
hub_pa_adult_stats<-  hub_pa_agg %>% group_by(ShortName) %>% summarise(min = min(avg_adult_pa),
                          q1 = quantile(avg_adult_pa,0.25),
                          mean = mean(avg_adult_pa),
                          median = median(avg_adult_pa),
                          q3 = quantile(avg_adult_pa,0.75),
                          max = max(avg_adult_pa),
                          sd = sd(avg_adult_pa),
                          IQR = IQR(avg_adult_pa),
                          latest_open_dt = max(min_dt),
                          latest_activity_dt = max(max_dt),
                          LowOutlier= (q1 - 1.5*IQR),
                          HighOutliet = (q3 + 1.5*IQR)
                                                                       
                           ) %>% arrange(desc(ShortName))
hub_pa_adult_stats
A tibble: 5 × 13
ShortNameminq1meanmedianq3maxsdIQRlatest_open_dtlatest_activity_dtLowOutlierHighOutliet
<fct><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><date><date><dbl><dbl>
Playgroups 1.40909096.32183910.314512 8.65853711.747253 58.49495 7.161484 5.4254142019-05-032019-12-20 -1.81628119.88537
Parents 2.60526327.48991014.39179311.96495619.500530 50.69632 9.68994012.0106192019-05-032019-12-20-10.52601937.51646
Lifestyle 0.63333333.485197 7.086186 6.451563 9.547549 19.33835 4.169396 6.0623522019-05-032019-12-20 -5.60833018.64108
EnglishClasses1.06122455.066402 8.033671 7.638427 9.799242 20.66507 3.831909 4.7328412019-05-032019-12-20 -2.03285916.89850
ChildLit 0.21052632.657143 7.490970 5.150259 7.867262104.0000013.231159 5.2101192019-06-282019-12-20 -5.15803615.68244

Slide 12: Average Child Attendace per PA Type (Right Tree Map)

In [71]:
hub_pa_child_stats<-  hub_pa_agg %>% group_by(ShortName) %>% summarise(min = min(avg_child_pa),
                          q1 = quantile(avg_child_pa,0.25),
                          mean = mean(avg_child_pa),
                          median = median(avg_child_pa),
                          q3 = quantile(avg_child_pa,0.75),
                          max = max(avg_child_pa),
                          sd = sd(avg_child_pa),
                          IQR = IQR(avg_child_pa),
                          latest_open_dt = max(min_dt),
                          latest_activity_dt = max(max_dt)
                           ) %>% arrange(desc(ShortName))
hub_pa_child_stats
A tibble: 5 × 11
ShortNameminq1meanmedianq3maxsdIQRlatest_open_dtlatest_activity_dt
<fct><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><date><date>
Playgroups 2.5454558.07894712.80213111.85416715.305122 60.74747 7.715675 7.2261752019-05-032019-12-20
Parents 0.0000003.076825 9.348573 5.59005611.084514 45.84456 9.834989 8.0076892019-05-032019-12-20
Lifestyle 0.0000003.52420926.97299613.31985438.880226159.0210833.00185835.3560182019-05-032019-12-20
EnglishClasses0.0000001.830145 4.456115 3.073932 5.028298 65.69388 7.710651 3.1981542019-05-032019-12-20
ChildLit 0.0000007.45000014.03597010.33333316.469097 66.0000010.989822 9.0190972019-06-282019-12-20

Slide 13:

Slide 13: SRs of Each Type on Average Per Month (Plot):

In [78]:
Datathon <- read.csv("~/Google Drive/Data Science/Datathon for Social Good/datathon-social-good-master/Datathon data/Internal data track/Datathon.csv")
In [79]:
### Convert Activity Date to Lubridate Date
Datathon$ActivityDate <- as.character(Datathon$ActivityDate)
Datathon$ActivityDate<- dmy(Datathon$ActivityDate)

### Get a long table for all SRS for all Hubs
hub_sr_agg <- Datathon %>% 
                            #### Remove all Activities with 0 attendees unless they are Service Referrals:
                            filter(CategoryName == "Service Referrals") %>%
                            #### Remove all SR with 'Number of Families Participating'
                            filter(ShortName != "Number of families participating in the Hub" &
                                   ShortName != 'Children'&
                                   ShortName != 'Adults' 
                                  ) %>% 
                            group_by(HubRandomID, ShortName ) %>%
                            summarise(cnt = n())

# get min and max dates for each hub and join back to table
hub_dts <- Datathon %>% group_by(HubRandomID) %>% summarise(min_dt = min(ActivityDate),
                                                            max_dt = max(ActivityDate),
                                                            duration = max_dt - min_dt
                                                           ) 
# Join maxmin dates back to table:
hub_sr_agg <- hub_sr_agg %>% inner_join(hub_dts, by= c("HubRandomID" = "HubRandomID"))
# Develop new metrics - avg per month:
hub_sr_agg <- hub_sr_agg %>% mutate(duration_mnth = as.numeric((duration/365)*12), 
                        avg_cnt_mnth = cnt/ duration_mnth                      
                       )
# Develop wide table with column for each SR based on avg cnt per month
hub_sr_agg_wide <- hub_sr_agg %>% select(HubRandomID, ShortName, min_dt, max_dt, duration_mnth,avg_cnt_mnth ) %>% pivot_wider(names_from = ShortName, values_from = c(avg_cnt_mnth), values_fill = list(cnt = 0))   
relevant_srs_wide <- hub_sr_agg_wide 
relevant_srs <- hub_sr_agg
In [80]:
colourCount = length(unique(relevant_srs $ShortName))
    getPalette = colorRampPalette(brewer.pal(9, "Set1"))    
    
        
    relevant_srs %>% ggplot(aes( x= factor(ShortName), y=avg_cnt_mnth, fill = factor(ShortName))) + 
      coord_flip()  +
      theme(legend.position = "none") +
      theme(
        # axis.title.x=element_blank(), 
        #axis.text.y=element_blank(), 
        #axis.ticks.y=element_blank(),
        legend.title =element_blank(),
        plot.title = element_text(color="#D95F02", size=22, face="bold.italic"),
        axis.title.x = element_text(color="#D53E4F", size=14, face="bold"),
        axis.title.y = element_text(color="#D53E4F", size=14, face="bold"),
        axis.text=element_text(size=16)
      ) +
      #scale_x_discrete(expand = c(0,0)) +
      #scale_y_discrete(expand = c(0,0)) +
#      scale_fill_brewer(colorRampPalette(brewer.pal(9, "Set1")), type = "div", direction = 1) +
      geom_boxplot(alpha =1)  +
      xlab("Service Referral") +
      ylab("Average Per Month") +
      ggtitle("Average Service Referrals Per Month", subtitle = "Hub Distribution") +
      theme(axis.title=element_text(size=12,face="bold"),
            axis.text=element_text(size=16),
      ) +
      #theme(axis.title.x=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank()) +
      geom_boxplot(fill=getPalette(colourCount))

options(repr.plot.width=30, repr.plot.height=15)

Slide 13: Alternate Plot

In [81]:
relevant_srs  %>% 
ggplot(aes(x=avg_cnt_mnth)) +
xlab("Average SR Per Month") +
ggtitle("Average SR Per Month", subtitle = "Hub Distribution")+
geom_dotplot(binpositions="all",stackgroups = TRUE,binwidth = 0.02, aes(fill =factor(ShortName)) )+
facet_grid(rows = vars(ShortName)) +
theme(legend.position = "none",
     axis.text=element_text(size=16),
     ) 
options(repr.plot.width=30, repr.plot.height=15)

Slide 13: SRs of Each Type on Average Per Month (Stats):

In [316]:
relevant_srs_stats<-  relevant_srs %>% group_by(ShortName) %>% summarise(min = min(avg_cnt_mnth),
                          q1 = quantile(avg_cnt_mnth,0.25),
                          mean = mean(avg_cnt_mnth),
                          median = median(avg_cnt_mnth),
                          q3 = quantile(avg_cnt_mnth,0.75),
                          max = max(avg_cnt_mnth),
                          sd = sd(avg_cnt_mnth),
                          IQR = IQR(avg_cnt_mnth),
                          latest_open_dt = max(min_dt),
                          latest_activity_dt = max(max_dt),
                          LowOutlier= (q1 - 1.5*IQR),
                          HighOutliet = (q3 + 1.5*IQR)
                           ) %>% arrange(desc(mean))
relevant_srs_stats
A tibble: 14 × 13
ShortNameminq1meanmedianq3maxsdIQRlatest_open_dtlatest_activity_dtLowOutlierHighOutliet
<fct><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><date><date><dbl><dbl>
Family 0.214655381.55575542.08836242.08068882.6571414.2441860.82238561.10138532019-05-032019-12-20-0.0963225454.309219
Education 0.429007991.35991361.79304411.83994962.2625713.8969200.73683830.90265752019-05-032019-12-20 0.0059273043.616557
Preschool 0.070736430.81081641.51785361.38585572.0324133.8969200.85137541.22159612019-05-032019-12-20-1.0215777563.864807
CHS 0.214504000.75370721.35848391.24500121.8119113.8871140.74069381.05820362019-05-032019-12-20-0.8335981743.399216
Intervention 0.214504000.68689721.34461381.33180321.8892343.8969200.78392841.20233682019-05-032019-12-20-1.1166079303.692739
Migrant 0.214655380.68641281.31233281.21204931.8232843.8969200.77856481.13687122019-05-032019-12-20-1.0188939823.528591
Violence 0.257586450.68288531.27362081.19836941.6214743.0692550.69035890.93858862019-05-032019-12-20-0.7249976153.029357
MatChildHealth 0.214504000.66123191.26324841.01388891.8892343.8969200.78893891.22800212019-05-032019-12-20-1.1807712223.731237
GPs 0.070736430.68660651.24291641.05589771.6133313.8969200.78120280.92672452019-05-032019-12-20-0.7034802863.003418
Other 0.214504000.51080381.22954831.18227331.6922463.8396220.75281701.18144242019-05-032019-12-20-1.2613598563.464410
Emergency 0.214504000.57881381.16071310.90602841.5810643.8871140.76889191.00225052019-05-032019-12-20-0.9245619643.084440
Finance 0.042961390.55085201.10670740.88356381.4178943.8969200.79213650.86704232019-05-032019-12-20-0.7497115512.718458
Accomm 0.213900610.43659160.96251750.75288781.2794213.8969200.66191310.84282942019-05-032019-12-20-0.8276525022.543665
Participants who gained employment0.021618100.35376430.87013940.71703601.1372893.0125130.71466910.78352442019-02-042019-12-20-0.8215223012.312575

Slide 14

Slide 14:Service Referrals Per PA (Plot)

In [82]:
Datathon <- read.csv("~/Google Drive/Data Science/Datathon for Social Good/datathon-social-good-master/Datathon data/Internal data track/Datathon.csv")
In [83]:
Datathon$ActivityDate <- as.character(Datathon$ActivityDate)
Datathon$ActivityDate<- dmy(Datathon$ActivityDate)

### Get a long table for all SRS for all Hubs
hub_pa_agg_mth <- Datathon %>% 
                    #### Remove all Activities with 0 attendees unless they are Service Referrals:
                    filter((CategoryName != "Service Referrals" & (ChildParticipants != 0 | AdultParticipants !=0 )) | CategoryName == "Service Referrals" ) %>%
                    #### Remove all SR with 'Number of Families Participating'
                    filter(ShortName != "Number of families participating in the Hub" &
                           ShortName != 'Children'&
                           ShortName != 'Adults' 
                          ) %>% 
                    group_by(HubRandomID, CategoryName ) %>%
                    summarise(cnt = n())

# get min and max dates for each hub and join back to table
hub_dts <- Datathon %>% group_by(HubRandomID) %>% summarise(min_dt = min(ActivityDate),
                                                            max_dt = max(ActivityDate),
                                                            duration = max_dt - min_dt)

hub_pa_agg_mth <- hub_pa_agg_mth %>% inner_join(hub_dts, by= c("HubRandomID" = "HubRandomID"))
# Join maxmin dates back to table:
hub_pa_agg_mth <- hub_pa_agg_mth %>% mutate(duration_mnth = as.numeric((duration/365)*12), 
                        avg_cnt_mnth = cnt/ duration_mnth                      
                       )

# Create a wide table useful for analysis 
hub_pa_agg_mth_wide <-  pivot_wider(hub_pa_agg_mth ,names_from = CategoryName, values_from = c(cnt, avg_cnt_mnth), values_fill = list(cnt = 0, avg_cnt_mnth =0)) 

### Rename columns:
hub_pa_agg_mth_wide <- hub_pa_agg_mth_wide %>% rename(
                                                      cnt_EAT = "cnt_Education and Training",
                                                      cnt_OOE = "cnt_One Off events",
                                                      cnt_PA = "cnt_Programmed Activities",
                                                      cnt_SR = "cnt_Service Referrals",
                                                      
                                                      # avg_cnt_mth
                                                      avg_cnt_mth_EAT = "avg_cnt_mnth_Education and Training",
                                                      avg_cnt_mth_OOE = "avg_cnt_mnth_One Off events",
                                                      avg_cnt_mth_PA = "avg_cnt_mnth_Programmed Activities",
                                                      avg_cnt_mth_SR = "avg_cnt_mnth_Service Referrals"
                                                    ) #For renaming tibble column using dplyrpipe
                                           #operator

### Calculate create feature as proportion of month:
hub_pa_agg_mth_wide <- hub_pa_agg_mth_wide %>% mutate( total_avg_cnt_mth = avg_cnt_mth_EAT+avg_cnt_mth_OOE+avg_cnt_mth_PA+avg_cnt_mth_SR,
                                PA_avg_prop = avg_cnt_mth_PA / total_avg_cnt_mth,
                                EAT_avg_prop = avg_cnt_mth_EAT/ total_avg_cnt_mth,
                                OOE_avg_prop = avg_cnt_mth_OOE/ total_avg_cnt_mth,
                                SR_avg_prop = avg_cnt_mth_SR/ total_avg_cnt_mth,
                                SR_PA_ratio = avg_cnt_mth_SR/avg_cnt_mth_PA                        
                                                     )
In [84]:
colourCount = length(unique(hub_pa_agg_mth_wide$HubRandomID))
hub_pa_agg_mth_wide  %>% 
      ggplot(aes(x=SR_PA_ratio)) +
      xlab("Average Service Referrals per Program Activity") +
      ylab("Proportion") +
      ggtitle("Average Service Referrals per Program Activity across all hubs", subtitle = "Hub Distribution")+
      geom_dotplot(binpositions="all",stackgroups = TRUE,binwidth = 0.08, fill=getPalette(colourCount))+
      theme(legend.position = "none",
            axis.text=element_text(size=16)
      ) +
      theme(legend.position = "none") +
      theme(
        # axis.title.x=element_blank(), 
        #axis.text.y=element_blank(), 
        #axis.ticks.y=element_blank(),
        legend.title =element_blank(),
        plot.title = element_text(color="#7FC97F", size=22, face="bold.italic"),
        axis.title.x = element_text(color="#D53E4F", size=14, face="bold"),
        axis.title.y = element_text(color="#D53E4F", size=14, face="bold"),
        axis.text=element_text(size=16)
      )+
    scale_y_continuous(NULL, breaks = NULL)
In [37]:
hub_pa_agg_mth_wide %>% ungroup() %>% summarise(min = min(SR_PA_ratio),
                          q1 = quantile(SR_PA_ratio,0.25),
                          mean = mean(SR_PA_ratio),
                          median = median(SR_PA_ratio),
                          q3 = quantile(SR_PA_ratio,0.75),
                          max = max(SR_PA_ratio),
                          sd = sd(SR_PA_ratio),
                          IQR = IQR(SR_PA_ratio),
                          latest_open_dt = max(min_dt),
                          latest_activity_dt = max(max_dt),
                          LowOutlier= (q1 - 1.5*IQR),
                          HighOutliet = (q3 + 1.5*IQR)
                           ) %>% arrange(desc(IQR))
A tibble: 1 × 12
minq1meanmedianq3maxsdIQRlatest_open_dtlatest_activity_dtLowOutlierHighOutliet
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><date><date><dbl><dbl>
00.56662531.0700410.91602451.5318443.4845360.70822770.96521832019-06-282019-12-20-0.88120222.979671
  • We can see that on average there is approximately 1.07 SR per PA run when whene excluding PAs with 0 attendees
  • There are a couple outliers hubs having more than 3 SR per PA.
In [8]:
hub_pa_agg_mth_wide %>% filter (SR_PA_ratio >3) %>% select(-PA_avg_prop, -EAT_avg_prop)
A grouped_df: 2 × 17
HubRandomIDmin_dtmax_dtdurationduration_mnthcnt_EATcnt_OOEcnt_PAcnt_SRavg_cnt_mth_EATavg_cnt_mth_OOEavg_cnt_mth_PAavg_cnt_mth_SRtotal_avg_cnt_mthOOE_avg_propSR_avg_propSR_PA_ratio
<int><date><date><drtn><dbl><int><int><int><int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
182002017-01-232019-12-111052 days34.58630 50194 6760.14456590 5.60915719.5453125.2990300.77257143.484536
183152016-02-012018-07-02 882 days28.99726200044313556.8972033015.27730546.7285568.9030600.67817823.058691

Slide 16

Slide 16: Hierarchical Clustering to Determine Similarities Between Hubs

In [85]:
Datathon <- read.csv("~/Google Drive/Data Science/Datathon for Social Good/datathon-social-good-master/Datathon data/Internal data track/Datathon.csv")
In [86]:
### Get a long table for all SRS for all Hubs
hub_pa_agg <- Datathon %>% filter(CategoryName == "Programmed Activities" & (ChildParticipants != 0 | AdultParticipants !=0 )) %>% group_by(HubRandomID, ShortName ) %>% summarise(
    cnt = n(),
    cnt_child = sum(ChildParticipants),
    cnt_adult = sum(AdultParticipants),
    avg_cnt_pa = ((cnt_child + cnt_adult) / cnt), # avg attendance per session
    avg_child_pa = cnt_child/cnt, # average child attendance per session
    avg_adult_pa = cnt_adult/cnt) # average adult attendance per session

# get min and max dates for each hub and join back to table
#hub_dts <- Datathon %>% group_by(HubRandomID) %>% summarise(min_dt = min(ActivityDate),
#                                                            max_dt = max(ActivityDate),
#                                                            duration = max_dt - min_dt
#                                                           ) 
# Join maxmin dates back to table:
hub_pa_agg <- hub_pa_agg %>% inner_join(hub_dts, by= c("HubRandomID" = "HubRandomID"))
# Develop new metrics - avg per month:
hub_pa_agg <- hub_pa_agg %>% mutate(duration_mnth = as.numeric((duration/365)*12), 
                        avg_cnt_mnth = cnt/ duration_mnth                      
                       )
# Develop wide table with column for each SR based on avg cnt per month
hub_pa_agg_wide <- hub_pa_agg %>% select(HubRandomID, ShortName, min_dt, max_dt, duration_mnth,avg_cnt_mnth ) %>% pivot_wider(names_from = ShortName, values_from = c(avg_cnt_mnth), values_fill = list(avg_cnt_mnth = 0))

Slide 16: Similarity between hubs based on average PA types run per month

In [87]:
data_clust <- hub_pa_agg_wide %>% ungroup() %>%select(-HubRandomID,-min_dt,-max_dt, -duration_mnth)  ### Columns to exclude
row.names(data_clust) <- hub_pa_agg_wide$HubRandomID  #### Row names for the HCLUST
data_clust_dist <- dist(data_clust)  #### This creates a distance matrix from each column
cluster <- hclust(data_clust_dist)   ##### This is the command to create the heirchical cluster
dend <- as.dendrogram(cluster) #### This is the function to create the dendrogram to display different style diagram
#### Several different ways to present the heirarchical diagram:
#plot(cluster, main= "Program Activity Types Per Month Similarity Clustering", cex=1, hang = -1, axes = FALSE)
#plot(dend, main= "MSO vs Condition Clustering", cex=0.9, hang = -1, axes = FALSE)
#plot(dend, hang = -1, type = "triangle")
hub_pa_agg_wide [cluster$order,]
Warning message:
“Setting row names on a tibble is deprecated.”
A grouped_df: 80 × 9
HubRandomIDmin_dtmax_dtduration_mnthEnglishClassesLifestyleParentsPlaygroupsChildLit
<int><date><date><dbl><dbl><dbl><dbl><dbl><dbl>
156692016-01-272016-12-0510.290411 0.00000018.6581470 6.413738 8.454473 6.80244941
194832016-01-272019-12-0546.290411 4.51497416.9581558 6.113577 5.789536 6.43761837
108332016-01-272019-12-1946.750685 1.32618412.4704055 4.256622 5.967827 3.91438115
140822016-02-012019-12-2046.619178 4.74053811.5617654 5.898860 9.030618 7.33603667
149352016-02-012019-12-2046.619178 5.94176111.0255054 5.59855410.038787 4.50458392
123692016-02-012019-12-2046.619178 4.761989 9.4810766 2.445346 6.263517 5.44840150
142262016-02-012019-12-2046.619178 4.654737 8.5587094 3.131758 9.245122 3.47496474
189222017-01-232019-12-2034.882192 3.52615512.2412033 6.880302 4.93088310.23444863
135162016-01-272019-12-1146.487671 3.871994 7.8085219 7.550389 6.066124 7.14167845
165122019-05-032019-12-12 7.331507 2.045964 7.9110613 5.183109 6.547085 5.59230194
135902016-02-012019-12-2046.619178 3.861072 7.271685516.409556 8.751763 2.55259756
112272016-02-012019-12-1846.553425 5.821269 4.8761182 7.73305111.127001 0.25776836
169122017-10-022019-12-2026.597260 3.797384 4.6621343 7.933148 9.023486 0.56396786
181542016-02-012019-12-1346.389041 4.742499 7.7820104 9.096976 9.679010 3.66465863
192772016-02-012019-12-1646.487671 6.410302 5.4638142 9.077676 6.453324 2.58132956
123942017-07-102019-12-2029.358904 3.67861110.1161814 9.775569 5.824468 0.00000000
175342016-02-012019-12-2046.619178 5.555654 8.4085567 4.697638 3.646568 0.00000000
183352016-02-012019-12-2046.619178 3.861072 7.4218383 5.770157 3.496415 0.12870240
118332018-10-152019-12-1914.136986 6.012597 1.2025194 6.57848813.722868 0.42441860
113322016-02-012019-12-1846.553425 7.110111 3.8665254 4.145774 8.334510 1.05255414
159942016-02-012019-12-1646.48767110.174741 2.2801744 5.571370 9.357320 0.64533239
158302017-07-102019-12-1329.128767 6.488431 1.5105342 7.312359 7.037716 1.54486456
176002018-01-292019-12-2022.684932 6.347826 0.8816425 6.612319 3.173913 0.52898551
149152019-02-042019-12-1210.224658 4.596731 0.0000000 4.987942 5.281350 0.00000000
169992016-02-012019-12-1346.389041 5.906567 2.3065793 4.677829 5.885011 0.02155682
165382016-02-012019-12-1846.553425 3.716161 3.7591219 6.787900 6.164960 0.96663136
110982016-01-272019-12-0946.421918 4.696058 5.0407224 7.194877 6.634797 4.15751889
171822017-10-022019-12-1926.564384 3.049196 4.9314150 7.077145 4.329105 3.16212871
196352016-02-012019-12-1346.389041 3.405977 6.0143515 8.062249 5.324533 2.88861328
160502017-10-022019-12-1926.564384 2.446885 1.5057756 9.599319 5.270215 2.78568482
148192016-02-012019-12-1846.5534251.67549440.537017422.9643362 4.6183501.8688206
149892019-02-012019-12-1310.3561643.47619050.289682544.0555556 3.6693121.8346561
169642019-02-012019-12-2010.5863013.02277430.472308494.6286232 4.9120080.7556936
116622017-01-232019-12-1334.6520551.12547440.519449722.9146901 2.6838240.0000000
139582017-04-172019-12-1231.8575342.60534920.721964223.5156519 2.8564670.0000000
176042017-04-172019-12-0531.6273971.42281700.411036043.6360880 3.1301980.6007450
196162016-02-012019-12-1746.5205481.97762070.429917554.1272085 3.2673730.3439340
187642016-02-012018-12-1734.5205480.81111110.579365083.5051587 4.8666670.4055556
100112017-04-172019-12-0931.7589040.22041060.062974474.1248275 3.4950830.0000000
167242016-02-012019-04-0838.2027400.00000000.444994263.2196644 3.9525960.0785284
137482017-04-172019-12-1331.8904112.00687292.289089353.7315292 3.4806700.0000000
171842016-02-012019-12-1946.5863012.23241593.262761704.4219007 3.1125030.0000000
158572018-10-152019-12-1614.0383562.92056994.274004682.7068696 2.7068700.0000000
185062017-01-232019-12-1734.7835621.78245123.536153122.9611689 2.7024260.0000000
114192016-02-012019-12-1246.3561640.56087471.984633573.3221040 1.6179082.3945035
174162017-10-022019-12-2026.5972602.66944790.601565724.1357643 2.2558712.7446436
114912016-02-012019-12-2046.6191781.05106964.182827934.5903855 9.3523749.6741302
144572016-02-012019-12-2046.6191785.55565353.474964741.8661848 9.7170319.9315350
164072016-02-012016-12-1910.5863010.00000006.990165631.133540411.1464800.0944617
108132017-04-172019-12-1732.0219180.90562973.279004112.9667180 7.2762663.1853183
164062016-02-012019-12-1946.5863012.83345105.323453314.1428487 6.9119032.2109504
108842016-02-012019-12-0446.0931514.40412511.974262962.5383381 7.3112823.7098787
130242016-02-012019-12-0946.2575345.77203273.221096902.2482824 4.9937812.9400616
183152016-02-012018-07-0228.9972603.44860172.724395312.2415911 3.6555183.2071995
115792019-02-082019-12-1310.1260273.06141770.000000000.0000000 9.2830090.0000000
153012019-05-032019-12-12 7.3315072.31875930.000000000.9547833 6.5470850.1363976
105492017-07-102019-12-2029.3589044.46201940.442795821.0558977 5.0410600.0000000
187472019-02-042019-12-1110.1917813.82661290.294354840.9811828 6.2795700.0000000
124202019-01-292019-12-1910.6520550.00000000.751028810.8449074 4.6000510.5632716
135062018-10-152019-12-1313.9397262.22386013.156446541.0760613 5.0933570.0000000

Export the data:

In [88]:
five_a<- hub_pa_agg_wide [cluster$order,]
write.table(five_a, "five_a.xls", sep="|")

Arrange the original DF in order of HCLUST similarity

In [89]:
plot(cluster, main= "Program Activity Types Per Month Similarity Clustering", cex=1, hang = -1, axes = FALSE)

Slide 16B: Similarity between hubs based on average attendace per PA

In [90]:
Datathon <- read.csv("~/Google Drive/Data Science/Datathon for Social Good/datathon-social-good-master/Datathon data/Internal data track/Datathon.csv")
In [91]:
### Get a long table for all SRS for all Hubs
hub_pa_agg <- Datathon %>% filter(CategoryName == "Programmed Activities" & (ChildParticipants != 0 | AdultParticipants !=0 )) %>% group_by(HubRandomID, ShortName ) %>% summarise(
    cnt = n(),
    cnt_child = sum(ChildParticipants),
    cnt_adult = sum(AdultParticipants),
    avg_cnt_pa = ((cnt_child + cnt_adult) / cnt), # avg attendance per session
    avg_child_pa = cnt_child/cnt, # average child attendance per session
    avg_adult_pa = cnt_adult/cnt) # average adult attendance per session

# get min and max dates for each hub and join back to table
#hub_dts <- Datathon %>% group_by(HubRandomID) %>% summarise(min_dt = min(ActivityDate),
#                                                            max_dt = max(ActivityDate),
#                                                            duration = max_dt - min_dt
#                                                           ) 
# Join maxmin dates back to table:
hub_pa_agg <- hub_pa_agg %>% inner_join(hub_dts, by= c("HubRandomID" = "HubRandomID"))
# Develop new metrics - avg per month:
#hub_pa_agg <- hub_pa_agg %>% mutate(duration_mnth = as.numeric((duration/365)*12), 
#                        avg_cnt_mnth = cnt/ duration_mnth                      
#                       )
# Develop wide table with column for each SR based on avg cnt per month
hub_pa_agg_wide <- hub_pa_agg %>% select(HubRandomID, ShortName,avg_cnt_pa) %>% pivot_wider(names_from = ShortName, values_from = c(avg_cnt_pa), values_fill = list( avg_cnt_pa = 0))
In [92]:
data_clust <- hub_pa_agg_wide %>% ungroup() %>%select(-HubRandomID)  ### Columns to exclude
row.names(data_clust) <- hub_pa_agg_wide$HubRandomID  #### Row names for the HCLUST
data_clust_dist <- dist(data_clust)  #### This creates a distance matrix from each column
cluster <- hclust(data_clust_dist)   ##### This is the command to create the heirchical cluster
dend <- as.dendrogram(cluster) #### This is the function to create the dendrogram to display different style diagram
#### Several different ways to present the heirarchical diagram:
plot(cluster, main= "Program Activity Types Per Month Similarity Clustering", cex=1, hang = -1, axes = FALSE)
#plot(dend, main= "MSO vs Condition Clustering", cex=0.9, hang = -1, axes = FALSE)
#plot(dend, hang = -1, type = "triangle")
hub_pa_agg_wide [cluster$order,]
Warning message:
“Setting row names on a tibble is deprecated.”
A grouped_df: 80 × 6
HubRandomIDEnglishClassesLifestyleParentsPlaygroupsChildLit
<int><dbl><dbl><dbl><dbl><dbl>
1760414.40000024.38461527.078261119.24242439.631579
10884 7.37931099.76923145.700855 37.31157328.116959
1236913.04955083.58144853.184211 43.13013734.948819
1083310.61290352.16981158.050251 33.70250929.879781
1081321.10344866.82857174.336842 25.12875523.068627
1109823.25229471.80341961.979042 28.24675323.103627
11419 5.76923150.47826118.772727 37.89333335.054054
1948334.44019157.03312141.918728 47.76119429.604027
1493515.82310555.19260738.597701 13.61111128.700000
18335 8.06666757.17052035.297398 14.147239 9.666667
11227 9.68265744.34801825.111111 21.43050215.166667
1359010.37777845.83185835.222222 20.85539218.487395
1408210.14479668.01669812.647273 16.03800511.301170
1760014.03472283.70000010.953333 21.50000017.916667
12394 8.62037056.484848 7.797909 25.356725 0.000000
1396613.02666748.96797215.945813 24.95798318.225000
1157916.322581 0.000000 0.000000 19.531915 0.000000
11662 5.205128 7.722222 7.673267 22.784946 0.000000
17184 3.076923 5.414474 6.043689 26.337931 0.000000
10932 0.000000 0.000000 0.000000 0.00000011.500000
16315 0.000000 6.452381 6.600000 0.000000 0.000000
15301 3.000000 0.000000 7.857143 11.687500 5.000000
13506 6.387097 4.727273 5.866667 8.647887 0.000000
15978 6.692308 2.833333 7.111111 7.625000 2.000000
1813910.80722928.158333 7.758242 14.17460314.294118
1881212.65909120.00000010.750000 14.72413824.075000
16512 3.60000015.568966 3.315789 13.02083310.731707
16724 0.000000 8.352941 7.902439 15.02649014.333333
11818 5.53571422.02395218.207254 23.51079112.842105
19713 5.70895523.88235313.182927 23.58536612.518519
18764 6.392857 21.2000013.89256241.20833 19.357143
12420 0.000000 13.0000014.55555628.57143 24.666667
15120 7.968750 18.6875018.54663227.74131 23.640000
1599413.589852 17.1132121.33204622.91264 27.633333
1183310.741176 18.4705917.09677416.52577 39.500000
1696411.500000 8.40000 5.69387817.36538 27.500000
15669 0.000000 38.0989635.43939419.70115 54.514286
1699915.171533 18.4766414.73271924.42125 59.000000
13748 8.187500 35.7397340.85714317.45045 0.000000
1753410.243243 39.8954128.36529717.20000 0.000000
1815411.554545 32.53463 9.36492927.68151 10.617647
1850611.435484 28.4715415.34951520.76596 0.000000
1422611.645161 32.5488726.56164410.83759 14.222222
1965014.237037 32.9727333.94736817.78049 12.038462
1376516.544118 28.3095225.66423427.15000 10.100000
1741613.830986 30.4375021.30909121.15000 20.383562
1133214.141994 24.6833321.35233218.84536 13.836735
1927711.526846 26.9842520.44312824.19000 15.458333
16538 7.919075 38.6800056.15506327.87456 10.200000
1445711.918919 19.4814859.91954018.70640 15.593952
1640611.022727 22.3830658.04663213.03727 16.097087
17182 7.000000 19.9542036.29255334.39130 27.714286
1831513.040000 12.1265832.49230826.05660 23.860215
1118812.148810 20.3834654.38650350.40625 20.285714
1351616.650000 18.4738349.58404632.43972 18.256024
1731915.798206 70.0461520.93877632.13265170.000000
1302422.235955148.0067155.00000034.03030 17.727941
1892211.544715162.2482454.41666728.47093 33.179272
1149166.755102130.8923117.59813115.48624 18.598670
1963512.689873114.45878 7.75935813.36842 9.701493

Export the data:

In [93]:
five_b<- hub_pa_agg_wide [cluster$order,]
#write.table(five_b, "five_b.xls", sep="|")